digraph D {
rankdir=LR;
node [shape=plaintext, fontname="helvetica"];
X -> Y;
}SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research
Semester 1, 2026
Last updated: 2026-01-23
I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.
By the end of this lecture, you will be able to:
Readings: TSwD Ch 14; ROS Ch 18-19
TSwD Chapter 14
ROS Chapters 18-19
The Fundamental Problem of Causal Inference
We can never observe both potential outcomes for the same unit—what happened AND what would have happened under a different treatment.
In randomised experiments, random assignment ensures treatment and control groups are comparable.
But what if we cannot randomise? We need methods to estimate causal effects from observational data.
For any unit \(i\), we define:
The problem: We only ever observe ONE of these potential outcomes:
\[y_i = y_i^0(1 - z_i) + y_i^1 z_i\]
where \(z_i\) indicates treatment assignment.
We typically estimate average causal effects:
| Estimand | Definition |
|---|---|
| SATE (Sample Average Treatment Effect) | \(\frac{1}{n}\sum_{i=1}^{n}(y_i^1 - y_i^0)\) |
| PATE (Population Average Treatment Effect) | \(\frac{1}{N}\sum_{i=1}^{N}(y_i^1 - y_i^0)\) |
| CATE (Conditional Average Treatment Effect) | Average effect for a subgroup |
Self-Selection Bias
When treatment groups differ systematically in ways that also affect the outcome, simple comparisons are misleading.
Example: Comparing outcomes of people who chose to take supplements vs. those who didn’t.
Directed Acyclic Graphs are visual representations of causal relationships:
A confounder is a variable that:
Confounders must be controlled for!
Failing to adjust for confounders creates a “backdoor path” that biases our causal estimate.
A mediator is a variable that lies on the causal pathway between treatment and outcome:
Do NOT control for mediators!
Controlling for a mediator blocks part of the causal effect you’re trying to estimate.
A collider is a variable caused by both the treatment and the outcome:
Do NOT control for colliders!
Controlling for a collider opens a spurious path and creates bias where none existed.
| Variable Type | Relationship | Control for it? |
|---|---|---|
| Confounder | Causes both treatment and outcome | ✓ Yes |
| Mediator | On the causal path from treatment to outcome | ✗ No |
| Collider | Caused by both treatment and outcome | ✗ No |
Key Insight
More controls are not always better! You must think carefully about the causal structure.
Difference-in-differences compares:
The treatment effect = (difference in treatment group) − (difference in control group)
\[\hat{\tau}_{DiD} = (\bar{Y}_{T,after} - \bar{Y}_{T,before}) - (\bar{Y}_{C,after} - \bar{Y}_{C,before})\]
Critical Assumption
In the absence of treatment, the treatment and control groups would have followed parallel trends over time.
This assumption:
set.seed(853)
n <- 1000
# Simulate DiD data
did_sim <- tibble(
person = rep(1:n, 2),
time = rep(c(0, 1), each = n),
treated = rep(sample(0:1, n, replace = TRUE), 2)
) |>
mutate(
# Outcome depends on time, treatment group, and their interaction
outcome = 5 +
2 * time + # Time trend
3 * treated + # Group difference
4 * time * treated + # Treatment effect!
rnorm(n * 2)
)The DiD model is a regression with an interaction term:
\[Y_{it} = \beta_0 + \beta_1 \cdot \text{Time}_t + \beta_2 \cdot \text{Treatment}_i + \beta_3 \cdot (\text{Time} \times \text{Treatment})_{it} + \epsilon_{it}\]
| (1) | |
|---|---|
| (Intercept) | 5.041 |
| (0.043) | |
| Time | 2.015 |
| (0.061) | |
| Treatment Group | 2.986 |
| (0.062) | |
| DiD Effect (Time × Treatment) | 3.897 |
| (0.088) | |
| Num.Obs. | 2000 |
| R2 | 0.917 |
| R2 Adj. | 0.917 |
Threats to validity:
Best Practice
Always visualise pre-treatment trends and discuss why parallel trends is plausible.
Goal: Create treatment and control groups that are similar on observed characteristics.
Problem: With many covariates, exact matching is impossible.
Solution: Match on a single number—the propensity score.
\[e(X) = P(\text{Treatment} = 1 | X)\]
The probability of receiving treatment given observed covariates.
set.seed(853)
n <- 1000
psm_data <- tibble(
id = 1:n,
age = sample(18:65, n, replace = TRUE),
income = rnorm(n, 50000, 15000)
) |>
mutate(
# Treatment probability depends on age and income
prop_score = plogis(-2 + 0.03 * age + 0.00003 * income),
treatment = rbinom(n, 1, prop_score),
# Outcome depends on treatment AND confounders
outcome = 10 + 5 * treatment + 0.1 * age + 0.0001 * income + rnorm(n, 0, 5)
)Naive estimate: 5.19
True effect: 5
Warning
The naive estimate is biased because treated and control groups differ on age and income!
A `matchit` object
- method: 1:1 nearest neighbor matching without replacement
- distance: Propensity score
- estimated with logistic regression
- number of obs.: 1000 (original), 686 (matched)
- target estimand: ATT
- covariates: age, income
| (1) | |
|---|---|
| (Intercept) | 9.843 |
| (1.017) | |
| Treatment Effect | 4.028 |
| (0.480) | |
| age | 0.094 |
| (0.016) | |
| income | 0.000 |
| (0.000) | |
| Num.Obs. | 686 |
| R2 | 0.364 |
| R2 Adj. | 0.361 |
Key Limitations
“Propensity score matching cannot match on unobserved variables… it is difficult to understand why individuals that appear to be so similar would have received different treatments, unless there is something unobserved.” — TSwD
Regression Discontinuity Design exploits situations where treatment is assigned based on a cutoff in a continuous variable (the “running variable” or “forcing variable”).
Examples:
Key insight: People just above and just below the cutoff are essentially identical, except for treatment!
The basic RDD model:
\[Y_i = \alpha + \tau \cdot \text{Treatment}_i + \beta \cdot \text{RunningVar}_i + \epsilon_i\]
Or, centring the running variable at the cutoff \(c\):
\[Y_i = \alpha + \tau \cdot D_i + \beta \cdot (X_i - c) + \epsilon_i\]
where \(D_i = 1\) if \(X_i \geq c\).
The coefficient \(\tau\) is the treatment effect at the cutoff.
# Centre the running variable
rdd_sim <- rdd_sim |>
mutate(mark_centred = mark - 80)
# RDD regression
rdd_model <- lm(future_score ~ got_scholarship + mark_centred,
data = rdd_sim)
modelsummary(rdd_model,
coef_rename = c("got_scholarship" = "Scholarship Effect",
"mark_centred" = "Mark (centred)"),
gof_omit = "IC|Log|F|RMSE")| (1) | |
|---|---|
| (Intercept) | 80.122 |
| (0.207) | |
| Scholarship Effect | 7.961 |
| (0.368) | |
| Mark (centred) | 0.487 |
| (0.032) | |
| Num.Obs. | 1000 |
| R2 | 0.833 |
| R2 Adj. | 0.832 |
Key Assumptions
Check for manipulation: Look for bunching just above or below the cutoff.
| Type | Description |
|---|---|
| Sharp RDD | Treatment perfectly determined by cutoff (everyone above cutoff is treated) |
| Fuzzy RDD | Cutoff changes probability of treatment but doesn’t guarantee it |
Fuzzy RDD requires instrumental variables methods (using the cutoff as an instrument).
Sometimes we have:
Solution: Find an instrumental variable that:
The instrument provides variation in treatment that is unrelated to the confounder.
Question: Does smoking cause lung cancer?
Problem: Smokers differ from non-smokers in many ways (confounders).
Solution: Use cigarette taxes as an instrument.
Stage 1: Regress treatment on instrument
\[\text{Smoking}_i = \alpha_1 + \gamma \cdot \text{Tax}_i + \epsilon_{1i}\]
Stage 2: Regress outcome on predicted treatment
\[\text{Cancer}_i = \alpha_2 + \beta \cdot \widehat{\text{Smoking}}_i + \epsilon_{2i}\]
The coefficient \(\beta\) is the causal effect.
set.seed(853)
n <- 2000
iv_sim <- tibble(
# Instrument: tax rate (varies by province)
tax_rate = sample(c(0.3, 0.4, 0.5), n, replace = TRUE),
# Unobserved confounder
health_conscious = rnorm(n),
# Treatment: affected by tax and confounder
smoking = 10 - 5 * tax_rate - 2 * health_conscious + rnorm(n),
# Outcome: affected by smoking and confounder
# True causal effect of smoking = -3
health = 80 - 3 * smoking + 5 * health_conscious + rnorm(n, 0, 5)
)Naive estimate: -4.91
IV estimate: -2.43
True effect: -3
Two Critical Assumptions
Finding valid instruments is hard. Many purported instruments fail the exclusion restriction.
| Method | Key Assumption | Data Requirement |
|---|---|---|
| DiD | Parallel trends | Before/after data for treatment and control groups |
| PSM | Selection on observables | Rich set of pre-treatment covariates |
| RDD | No manipulation; continuity | Running variable with cutoff |
| IV | Exclusion restriction | Valid instrument |
digraph D {
node [shape=box, fontname="helvetica"];
Q1 [label="Is there a\nnatural cutoff?"];
Q2 [label="Is there a\nvalid instrument?"];
Q3 [label="Is there before/after\ndata with control group?"];
Q4 [label="Can you measure\nall confounders?"];
RDD [label="RDD", shape=ellipse];
IV [label="IV", shape=ellipse];
DiD [label="DiD", shape=ellipse];
PSM [label="PSM", shape=ellipse];
Caution [label="Be very\ncautious", shape=ellipse];
Q1 -> RDD [label="Yes"];
Q1 -> Q2 [label="No"];
Q2 -> IV [label="Yes"];
Q2 -> Q3 [label="No"];
Q3 -> DiD [label="Yes"];
Q3 -> Q4 [label="No"];
Q4 -> PSM [label="Yes"];
Q4 -> Caution [label="No"];
}# DAGs
library(ggdag) # Visualise and analyse DAGs
library(dagitty) # DAG algebra
# Difference-in-Differences
library(did) # Callaway and Sant'Anna estimator
library(fixest) # Fast fixed effects estimation
# Propensity Score Matching
library(MatchIt) # Matching methods
library(cobalt) # Balance assessment
# Regression Discontinuity
library(rdrobust) # Robust RDD estimation
library(rddensity) # Manipulation testing
# Instrumental Variables
library(estimatr) # iv_robust() function
library(ivreg) # 2SLS estimationWeek 13: Advanced Applications and Best Practices
Reading: TSwD Ch 10, 15, 17; ROS Ch 11.8, Appendix B